Memory-Efficient Katakana Compound Segmentation using Conditional Random Fields
نویسندگان
چکیده
The absence of explicit word boundary delimiters, such as spaces, in Japanese texts causes all kinds of troubles for Japanese morphological analysis systems. Particularly, out-of-vocabulary words represent a very serious problem for the systems which rely on dictionary data to establish word boundaries. In this paper we present a solution for decompounding of katakana sequences (one of the main sources of the out-of-vocabulary words) using a discriminative model based on Conditional Random Fields. One of the notable features of the proposed approach is its simplicity and memory efficiency.
منابع مشابه
Painless Semi-Supervised Morphological Segmentation using Conditional Random Fields
We discuss data-driven morphological segmentation, in which word forms are segmented into morphs, that is the surface forms of morphemes. We extend a recent segmentation approach based on conditional random fields from purely supervised to semi-supervised learning by exploiting available unsupervised segmentation techniques. We integrate the unsupervised techniques into the conditional random f...
متن کاملCharacter Categorization via Latent Dirichlet Allocation for Kana Sequence Segmentation with Conditional Random Fields
We propose an efficient Kana sequence segmentation as a component of faster and easier interfaces for e-learning systems. We assign categories to Kana characters via latent Dirichlet allocation (LDA) and use the categories to compose additional features for conditional random fields (CRF). We compare the categories our method gives and those manually prepared by their efficiency in Kana sequenc...
متن کاملStudies for Segmentation of Historical Texts: Sentences or Chunks?
We present some experiments on text segmentation for German texts aimed at developing a method of segmenting historical texts. Since such texts have no (consistent) punctuation, we use a machine learning approach to label tokens with their relative positions in text segments using Conditional Random Fields. We compare the performance of this approach on the task of segmenting of text into sente...
متن کاملEfficient Structured Prediction with Latent Variables for General Graphical Models
In this paper we propose a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. We describe a local entropy approximation for this general formulation using duality, and derive an efficient message passing algorithm that is guaranteed to converge. We demonstrate its effec...
متن کاملBroadcast News Story Segmentation Using Conditional Random Fields and Multimodal Features
This paper proposes to integrate multi-modal features using conditional random fields (CRF) for broadcast news story segmentation. We study story boundary cues from lexical, audio and video modalities, where lexical features consist of lexical similarity, chain strength and overall cohesiveness, acoustic features involve pause duration, pitch, speaker change and audio event type, and visual fea...
متن کامل